# what are you iterating over? The vector from -10:10
<- c(-10:10)
items_to_iterate_over
# pre-allocate the results
<- rep(0, length(items_to_iterate_over))
out # write the iteration statement --
# we'll use indices so we can store the output easily
for (i in seq_along(items_to_iterate_over)) {
# do something
# we capture the median of three random numbers
# from normal distributions various means
<- median(rnorm(n = 3,mean = items_to_iterate_over[[i]]))
out[[i]] }
Module 8: Iteration
In this lesson, we learn how to implement looping techniques to minimize repeated code and improve efficiency.
Iteration
Download a copy of Module 8 slides
Lab 8
General Guidelines:
You will encounter a few functions we did not cover in the lecture video. This will give you some practice on how to use a new function for the first time. You can try following steps:
- Start by typing
?new_function
in your Console to open up the help page - Read the help page of this new_function. The description might be too technical for now. That’s OK. Pay attention to the Usage and Arguments, especially the argument
x
orx
,y
(when two arguments are required) - At the bottom of the help page, there are a few examples. Run the first few lines to see how it works
- Apply it in your lab questions
It is highly likely that you will encounter error messages while doing this lab Here are a few steps that might help get you through it.
- Locate which line is causing this error first
- Check if you may have a typo in the code. Sometimes another person can spot a typo faster than you.
- If you enter the code without any typo, try googling the error message
- Scroll through the top few links see if any of them helps
- Try working on the next few questions while waiting for answers by TAs
Warm-up
Recall, for-loops are an iterator that help us repeat tasks while changing inputs. The most common structure for your code will look like the following code. This can be simplified if you are not storing results.
Writing for-loops
- Write a for-loop that prints the numbers 5, 10, 15, 20, 250000.
- Write a for-loop that iterates over the indices of x and prints the ith value of x.
<- c(5, 10, 15, 20, 250000)
x # replace the ... with the relevant code
for (i in ... ){
print(x[[...]])
}
- Write a for-loop that simplifies the following code so that you don’t repeat yourself! Don’t worry about storing the output yet. Use
print()
so that you can see the output. What happens if you don’t useprint()
?
sd(rnorm(5))
sd(rnorm(10))
sd(rnorm(15))
sd(rnorm(20))
sd(rnorm(25000))
- adjust your for-loop to see how the sd changes when you use
rnorm(n, mean = 4)
- adjust your for-loop to see how the sd changes when you use
rnorm(n, sd = 4)
- Now store the results of your for-loop above in a vector. Pre-allocate a vector of length 5 to capture the standard deviations.
vectorization vs for loops
Recall, vectorized functions operate on a vector item by item. It’s like looping over the vector! The following for-loop is better written vectorized.
Compare the loop version
<- c("Marlene", "Jacob", "Buddy")
names <- character(length(names))
out
for (i in seq_along(names)) {
<- paste0("Welcome ", names[[i]])
out[[i]] }
to the vectorized version
<- c("Marlene", "Jacob", "Buddy")
names <- paste0("Welcome ", names) out
The vectorized code is preferred because it is easier to write and read, and is possibly more efficient.
Rewrite your first for-loop, where you printed 5, 10, 15, 20, 250000 as vectorised code
Rewrite this for-loop as vectorized code:
<- c(0:10)
radii
<- double(length(radii))
area
for (i in seq_along(radii)) {
<- pi * radii[[i]] ^ 2
area[[i]] }
- Rewrite this for-loop as vectorized code:
<- c(-1:10)
radii
<- double(length(radii))
area
for (i in seq_along(radii)) {
if (radii[[i]] < 0) {
<- NaN
area[[i]] else {
} <- pi * radii[[i]] ^ 2 }
area[[i]] }
Extension
Simulating the Law of Large Numbers
The Law of Large Numbers says that as sample sizes increase, the mean of the sample will approach the true mean of the distribution. We are going to simulate this phenomenon!
We’ll start by making a vector of sample sizes from 1 to 50, to represent increasing sample sizes.
Create a vector called sample_sizes that is made up of the numbers 1 through 50. (Hint: You can use seq()
or :
notation).
We’ll make an empty tibble to store the results of the for loop:
<- tibble(n = integer(), sample_mean = double()) estimates
Write a loop over the sample_sizes you specified above. In the loop, for each sample size you will:
- Calculate the mean of a sample from the random normal distribution with
mean = 0
andsd = 5
. - Make an intermediate tibble to store the results
- Append the intermediate tibble to your tibble using
bind_rows()
.
set.seed(60637)
for (___ in ___) {
# Calculate the mean of a sample from the random normal distribution with mean = 0 and s
<- ___
___ # Make a tibble with your estimates
<- tibble(n = ___, sample_mean = ___)
this_estimate # Append the new rows to your tibble
<- bind_rows(estimates, ___)
___ }
We can use ggplot2
to view the results. Fill in the correct information for the data and x and y variables, so that the n column of the estimates tibble is plotted on the x-axis, while the sample_mean
column of the estimates tibble is plotted on the y-axis.
# your data goes in the first position
%>%
___ ggplot(aes(x = ___, y = ___)) +
geom_line()
- As the sample size
(n)
increases, does the sample mean becomes closer to 0, or farther away from 0?
Rewrite the loop code without looking at your previous code and use a wider range of sample sizes. Try several different sample size combinations. What happens when you increase the sample size to 100? 500? 1000? Use the seq()
function to generate a sensibly spaced sequence.
set.seed(60637)
<- ___
sample_sizes <- ___
estimates_larger_n
for (___ in ___) {
<- ___
___ <- ___
___ <- ___
___
}
%>%
___ ggplot(___(___ = ___, ___ = ___)) +
geom_line()
- How does this compare to before?
Extending Our Simulation
Looking at your results, you might think a small sample size is sufficient for estimating a mean, but your data had a relatively small standard deviation compared to the mean. Let’s run the same simulation as before with different standard deviations.
Do the following:
Create a vector called population_sd of length 4 with values 1, 5, 10, and 20 (you’re welcome to add larger numbers if you wish).
Make an empty tibble to store the output. Compared to before, this has an extra column for the changing population standard deviations.
Write a loop inside a loop over population_sd and then sample_sizes.
Then, make a ggplot graph where the x and y axes are the same, but we facet (aka we create small multiples of individual graphs) on
population_sd
.
set.seed(60637)
<- ___
population_sd # use what every you came up with in the previous part
<- ___
sample_sizes <- ___
estimates_adjust_sd for (___ in ___){
for (___ in ___) {
<- ___
___ <- ___
___ <- ___
___
}
}
%>%
___ ggplot(___) +
geom_line() +
facet_wrap(~population_sd) +
theme_minimal()
How do these estimates differ as you increase the standard deviation?
Want to improve this tutorial? Report any suggestions/bugs/improvements on here! We’re interested in learning from you how we can make this tutorial better.